The Camaleon Web Wrapper Engine

نویسندگان

  • Aykut Firat
  • Stuart E. Madnick
  • Michael Siegel
چکیده

The web is rapidly becoming the universal repository of information. A major challenge is the ability to support the effective flow of information among the sources and services on the web and their interconnection with legacy systems that were designed to operate with traditional relational databases. This paper describes a technology and infrastructure to address these needs, based on the design of a web wrapper engine called Caméléon. Caméléon extracts data from web pages using declarative specification files that define extraction rules. Caméléon is based on the relational model and designed to work as a relational front-end to web sources. ODBC drivers can be used to send SQL queries to Caméléon. Query results by Caméléon are presented in either XML or HTML table formats. Users can also easily call Caméléon from other applications (e.g. Microsoft Excel by using Caméléon web query file (Caméléon.iqy)). Unlike its predecessor, Grenouille, Caméléon lets users segment web pages and define independent extraction patterns for each attribute. The HTTPClient package used in Caméléon supports both GET and POST methods and is able to deal with authentication, redirection, and cookie issues when connecting to web pages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

AutoWrapper: automatic wrapper generation for multiple online services

A crucial challenge for information extraction from the WWW is to generate wrappers, which are information extraction patterns or rules, which apply to numerous Web sites with great diversity in both format and content. Generating wrappers manually is tedious, time consuming and errorprone. Recent research has successfully adapted machine learning technology to generate wrappers for semi-struct...

متن کامل

Gleaning answers from the web∗

A wide variety of valuable textual information resides on the Web, but very little is in a machineunderstandable form such as XML. Instead, the content is usually embedded in HTML markup or other encodings designed for human consumption. The information extraction task is to automatically populate a database with content gleaned from information sources such as Web pages. Wrappers are an import...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

General Strategy for Querying Web Sources in a Data Federation Environment

Modern database management systems are supporting the inclusion and querying of nonrelational sources within a data federation environment via wrappers. Wrapper development for Web sources, however, is a convolution of code with extraction and query planning knowledge and becomes a daunting task. We use IBM DB2 federation engine to demonstrate the challenges of incorporating Web sources into a ...

متن کامل

IWrap: Instant Web Wrapper Generator

In this paper, we describe an automatic Web wrapper generator that creates specification files, which contain the schema information and extraction rules for a class of Web pages. These specification files can then used by a wrapper engine (e.g. MIT COIN Grenouille) to extract information from the semi-structured Web sites. We create specification files through a WYSIWYG GUI with minimal user i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000